Tajik-Farsi Persian Transliteration Using Statistical Machine Translation

نویسنده

  • Chris Irwin Davis
چکیده

Tajik Persian is a dialect of Persian spoken primarily in Tajikistan and written with a modified Cyrillic alphabet. Iranian Persian, or Farsi, as it is natively called, is the lingua franca of Iran and is written with the Persian alphabet, a modified Arabic script. Although the spoken versions of Tajik and Farsi are mutually intelligible to educated speakers of both languages, the difference between the writing systems constitutes a barrier to text compatibility between the two languages. This paper presents a system to transliterate text between these two different Persian dialects that use incompatible writing systems. The system also serves as a mechanism to facilitate sharing of computational linguistic resources between the two languages. This is relevant because of the disparity in resources for Tajik versus Farsi.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Amirkabir Machine Transliteration System for NEWS 2011: Farsi-to-English Task

In this paper we describe the statistical machine transliteration system of Amirkabir University of Technology, developed for NEWS 2011 shared task. This year we participated in English to Persian language pair. We use three systems for transliteration: the first system is a maximum entropy model with a new proposed alignment algorithm. The second system is Sequitur g2p tool, an open source gra...

متن کامل

Benchmarking SMT Performance for Farsi Using the TEP++ Corpus

Statistical machine translation (SMT) suffers from various problems which are exacerbated where training data is in short supply. In this paper we address the data sparsity problem in the Farsi (Persian) language and introduce a new parallel corpus, TEP++. Compared to previous results the new dataset is more efficient for Farsi SMT engines and yields better output. In our experiments using TEP+...

متن کامل

Statistical Machine Translation as a Grammar Checker for Persian Language

Existence of automatic writing assistance tools such as spell and grammar checker/corrector can help in increasing electronic texts with higher quality by removing noises and cleaning the sentences. Different kinds of errors in a text can be categorized into spelling, grammatical and real-word errors. In this article, the concepts of an automatic grammar checker for Persian (Farsi) language, is...

متن کامل

Named Entity Transliteration Generation Leveraging Statistical Machine Translation Technology

Automatically identifying that different orthographic variants of names are referring to the same name is a significant challenge for processing natural language processing since they typically constitute the bulk of the out-of-vocabulary tokens. The problem is exacerbated when the name is foreign. In this paper we address the problem of generating valid orthographic variants for proper names, ...

متن کامل

Statistical Approach to Transliteration from English to Punjabi

-Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Transliteration is a crucial factor in CLIR and MT. It is important for Machine Translation, especially when the languages do not use the same scripts. This paper addresses the issue of statistical mach...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012